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Abstract 

We  introduce  a  randomized  algorithm  for  overdeternrined  linear  least-squares  re¬ 
gression.  Given  an  arbitrary  full-rank  m  x  n  matrix  A  with  m  >  n,  any  m  x  1  vector 
b,  and  any  positive  real  number  e,  the  procedure  computes  an  n  x  1  vector  x  which 
minimizes  the  spectral  norm  \\Ax  —  6||  to  relative  precision  e.  The  algorithm  typically 
requires  O  ((log(n)  +  log(l/e))  m  n  +  ?r3)  floating-point  operations.  This  cost  is  less 
than  the  0(mn2)  required  by  the  classical  schemes  based  on  Qi?-decompositions  or 
bidiagonalization.  We  present  several  numerical  examples  illustrating  the  performance 
of  the  algorithm. 


1  Introduction 

Least-squares  fitting  has  permeated  the  sciences  and  engineering  following  its  introduction 
over  two  centuries  ago  (see,  for  example,  [3]  for  a  brief  historical  review).  Linear  least- 
squares  regression  is  fundamental  in  the  analysis  of  data,  such  as  that  generated  from  biology, 
econometrics,  engineering,  physics,  and  many  other  technical  disciplines. 

Perhaps  the  most  commonly  encountered  formulation  of  linear  least-squares  regression 
involves  a  full-rank  m  x  n  matrix  A  and  an  mxl  column  vector  b,  with  m  >  n;  the  task  is  to 
find  an  n  x  1  column  vector  x  such  that  the  spectral  norm  ||Aa;  —  6||  is  minimized.  Classical 
algorithms  using  Q-R-decompositions  or  bidiagonalization  require 

^classical  =  0(j7lTl  )  (1) 

floating-point  operations  in  order  to  compute  x  (see,  for  example,  [3]  or  Chapter  5  in  [5]). 
The  present  paper  introduces  a  randomized  algorithm  that,  given  any  positive  real  number 
e,  computes  a  vector  x  minimizing  \\Ax  —  6||  to  relative  precision  e,  that  is,  the  algorithm 
produces  a  vector  x  such  that 

\\A*  ~  b\\  ~  min  II AV  ~  b\\  <  £  min  ||  Ay-  b\\.  (2) 

y£<L  ye<L 


♦Partially  supported  by  ONR  Grant  #N00014-07-l-0711,  DARPA  Grant  #FA9550-07-l-0541,  and  NGA 
Grant  #HM1582-06-l-2039. 


1 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

28  APR  2008 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2008  to  00-00-2008 


4.  TITLE  AND  SUBTITLE 

A  fast  randomized  algorithm  for  overdetermined  linear  least-squares 
regression 

6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Yale  University, Department  of  Computer  Science  ,New  Ha ven,CT, 06520 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROIECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 


13.  SUPPLEMENTARY  NOTES 


14.  ABSTRACT 

see  report 


15.  SUBIECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 


a.  REPORT 

unclassified 


b.  ABSTRACT 

unclassified 


c.  THIS  PAGE 

unclassified 


17.  LIMITATION  OF 

18.  NUMBER 

ABSTRACT 

OF  PAGES 

Same  as 

14 

Report  (SAR) 

19a.  NAME  OF 
RESPONSIBLE  PERSON 


Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


This  algorithm  typically  requires 

Grand  =  O  ((log(ra)  +  log(l/£))  mn  +  n3)  (3) 

operations.  When  n  is  sufficiently  large  and  m  is  much  greater  than  n  (that  is,  the  regression 
is  highly  overdetermined),  then  (3)  is  less  than  (1).  Furthermore,  in  the  numerical  exper¬ 
iments  of  Section  6,  the  algorithm  of  the  present  article  runs  substantially  faster  than  the 
standard  methods  based  on  ^^-decompositions. 

The  method  of  the  present  article  is  an  extension  of  the  methods  introduced  in  [7],  [8], 
and  [4],  Their  algorithms  and  ours  have  similar  costs;  however,  for  the  computation  of  x 
minimizing  ||7la;  —  6||  to  relative  precision  e,  the  earlier  algorithms  involve  costs  proportional 
to  1/e,  whereas  the  algorithm  of  the  present  paper  involves  a  cost  proportional  to  log(l/e) 
(see  (3)  above). 

The  present  article  describes  algorithms  optimized  for  the  case  when  the  entries  of  A 
and  b  are  complex  valued.  Needless  to  say,  real-valued  versions  of  our  schemes  are  similar. 
This  paper  has  the  following  structure:  Section  2  sets  the  notation.  Section  3  discusses  a 
randomized  linear  transformation  which  can  be  applied  rapidly  to  arbitrary  vectors.  Section  4 
provides  the  relevant  mathematical  apparatus.  Section  5  describes  the  algorithm  of  the 
present  paper.  Section  6  illustrates  the  performance  of  the  algorithm  via  several  numerical 
examples.  Section  7  draws  conclusions  and  proposes  directions  for  future  work. 


2  Notation 


In  this  section,  we  set  notational  conventions  employed  throughout  the  present  paper. 

We  denote  an  identity  matrix  by  1.  We  consider  the  entries  of  all  matrices  in  this  paper 
to  be  complex  valued.  For  any  matrix  A,  we  define  A*  to  be  the  adjoint  of  A,  and  we  define 
the  norm  ||yl||  of  A  to  be  the  spectral  (/2-operator)  norm  of  A,  that  is,  ||7l||  is  the  greatest 
singular  value  of  A.  We  define  the  condition  number  of  A  to  be  the  l2  condition  number  of 
A,  that  is,  the  greatest  singular  value  of  A  divided  by  the  least  singular  value  of  A.  If  A  has 
at  least  as  many  rows  as  columns,  then  the  condition  number  of  A  is  given  by  the  expression 

ka  =  n/PMII  ||(AM)-1||.  (4) 

For  any  positive  integers  m  and  n  with  m  >  n,  and  any  m  x  n  matrix  A,  we  will  be  using 
the  singular  value  decomposition  of  A  in  the  form 


A  —  tt  y  V* 

x n  is m. X n  ^n. X n.  *  n 


ymxn  ^nxn  v  nxm 


(5) 


where  U  is  an  m  x  n  matrix  whose  columns  are  orthonormal,  V  is  an  n  x  n  matrix  whose 
columns  are  orthonormal,  and  E  is  a  diagonal  n  x  n  matrix  whose  entries  are  all  nonnega¬ 
tive.  We  abbreviate  “singular  value  decomposition”  to  “SVD”  and  “independent,  identically 
distributed”  to  “i.i.d.” 

For  any  positive  integer  m,  we  define  the  discrete  Fourier  transform  F ^  to  be  the 
complex  m  x  m  matrix  with  the  entries 

(F(m))jifc  =  _L  e-2ni(j-l){k-l)/m  (6) 

A /TO 
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for  j,  k  =  1,  2,  . . . ,  m  —  1,  m,  where  i  =  \/—l  and  e  =  exp(l);  if  the  size  m  is  clear  from 
the  context,  then  we  omit  the  superscript  in  F denoting  the  discrete  Fourier  transform 
by  simply  F. 


3  Preliminaries 


In  this  section,  we  discuss  a  subsampled  randomized  Fourier  transform.  [1],  [4],  [7],  and  [8] 
introduced  a  similar  transform  for  similar  purposes. 

For  any  positive  integers  l  and  m  with  /  <  m,  we  define  the  l  x  m  SRFT  to  be  the  l  x  m 
random  matrix 

Tlxm  Crixm.  Hmxmi  (7) 

where  G  and  FI  are  defined  as  follows. 

In  (7),  G  is  the  l  x  m  random  matrix  given  by  the  formula 


Glxm  —  Sj 


i  f  r> 

Lxm  -*■  mxm  ^mxmi 


(8) 


where  S  is  the  l  x  m  matrix  whose  entries  are  all  zeros,  aside  from  a  single  1  in  column  Sj  of 
row  j  for  j  =  1,  2,  . . . ,  /  —  1,  /,  where  si,  •  •  • ,  s/-i,  si  are  i.i.d.  integer  random  variables, 
each  distributed  uniformly  over  {1,2,  ...,m  —  1  ,  m};  moreover,  F  is  the  m  x  m  discrete 
Fourier  transform,  and  D  is  the  diagonal  m  x  m  matrix  whose  diagonal  entries  d{,  d2,  . . . , 
dm- 1,  dm,  are  i.i.d.  complex  random  variables,  each  distributed  uniformly  over  the  unit  circle. 
(In  our  numerical  implementations,  we  drew  si,  s2,  . Sj_i,  S[  from  {1,2, ...  ,m  —  1  ,m} 
without  replacement,  instead  of  using  i.i.d.  draws.) 

In  (7),  H  is  the  m  x  m  random  matrix  given  by  the  formula 


Hn 


=  0 


n 


0 


mxm  ±±mxm  ^mxm  ^mxm  xxmxm  ^mxmi 


(9) 


where  II  and  II  are  m  x  m  permutation  matrices  chosen  independently  and  uniformly  at 
random,  and  Z  and  Z  are  diagonal  m  x  m  matrices  whose  diagonal  entries  do  C'n  •  ■  • , 
Cm- 1,  Cm  and  Cl,  C2,  •  •  • ,  Cm— 1,  Cm  are  i  i  d.  complex  random  variables,  each  distributed 
uniformly  over  the  unit  circle;  furthermore,  0  and  0  are  the  m  x  m  matrices  defined  via  the 
formulae 
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and  (the  same  as  (10),  but  with  tildes) 
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.d.  real  random  variables  drawn 

uniformly  from  [0,2n\.  We  observe  that  0,  0,  n,  II,  Z ,  and  Z  are  all  unitary,  and  so  H  is 
also  unitary. 

We  call  the  transform  T  an  “SRFT’:  for  lack  of  a  better  term. 


Remark  3.1  The  earlier  works  [6]  and  [9]  omitted  the  matrix  H  in  the  definition  (7)  of  the 
SRFT.  Numerical  experiments  indicate  that  including  H  improves  the  performance  of  the 
algorithm  of  the  present  paper  on  sparse  matrices. 

The  following  lemma  is  similar  to  the  subspace  Johnson- Lindenstrauss  lemma  (Corollary 
11)  of  [8],  and  is  proven  (in  a  slightly  different  form)  as  Lemma  4.4  of  [9].  The  lemma 
provides  a  highly  probable  upper  bound  on  the  condition  number  of  the  product  of  the  l  x  m 
matrix  G  defined  in  (8)  and  an  independent  m  x  n  matrix  U  whose  columns  are  orthonormal, 
assuming  that  l  is  less  than  m  and  is  sufficiently  greater  than  n2. 

Lemma  3.2  Suppose  that  a  and  (3  are  real  numbers  greater  than  l,  and  l,  m,  and  n  are 
positive  integers,  such  that 

OL  +  1  \ 

a  —  1 J 

Suppose  further  that  G  is  the  l  x  m  random  matrix  defined  in  (8).  Suppose  in  addition  that 
U  is  an  m  x  n  random  matrix  whose  columns  are  orthonormal,  and  that  U  is  independent 
ofG. 

Then,  the  condition  number  of  GU  is  at  most  yW  with  probability  at  least  1  —  1. 

The  following  corollary  of  Lemma  3.2  follows  immediately  from  the  fact  that  the  random 
matrix  H  defined  in  (9)  is  unitary  and  independent  of  the  random  matrix  G  defined  in  (8). 
The  corollary  provides  a  highly  probable  upper  bound  on  the  condition  number  of  the  l  x  m 
SRFT  (defined  in  (7))  applied  to  an  m  x  n  matrix  U  whose  columns  are  orthonormal, 
assuming  that  l  is  less  than  m  and  is  sufficiently  greater  than  n2 . 


pn2 


(12) 


m  >  l  > 


4 


Corollary  3.3  Suppose  that  a  and  /3  are  real  numbers  greater  than  1,  and  l,  m,  and  n  are 
positive  integers,  such  that  (12)  holds.  Suppose  further  that  T  is  the  l  x  m  SRFT  defined 
in  (7).  Suppose  in  addition  that  U  is  an  m  x  n  matrix  whose  columns  are  orthonormal. 

Then,  the  condition  number  ofTU  is  at  most  y/a  with  probability  at  least  1  — 

The  following  lemma  states  that,  if  A  is  an  m  x  n  matrix,  b  is  an  m  x  1  vector,  and  T 
is  the  l  x  m  SRFT  defined  in  (7),  then,  with  high  probability,  an  n  x  1  vector  z  minimizing 
|| T  Az  —  T  6||  also  minimizes  || Az  —  6||  to  within  a  reasonably  small  factor.  Whereas  solving 
Az  ~  b  in  the  least-squares  sense  involves  m  simultaneous  linear  equations,  solving  T Az  ~ 
T  b  involves  just  l  simultaneous  equations.  This  lemma  is  modeled  after  similar  results  in  [4], 
[7],  and  [8],  and  is  proven  (in  a  slightly  different  form)  as  Lemma  4.8  of  [9]. 

Lemma  3.4  Suppose  that  a  and  (3  are  real  numbers  greater  than  1,  and  l,  m,  and  n  are 
positive  integers,  such  that 


m  >  l  >  /3(n  +  l)2.  (13) 

Suppose  further  that  T  is  the  l  x  m  SRFT  defined  in  (7).  Suppose  in  addition  that  A  is  an 
m  x  n  matrix,  and  b  is  an  mx  1  vector.  Suppose  finally  that  z  is  an  nx  1  vector  minimizing 
the  quantity 

\\TAz-Tb\\.  (14) 

Then, 

\\Az-b\\  <  min  \\Ay-b\\.  (15) 


4  Mathematical  apparatus 

In  this  section,  we  prove  a  theorem  which  (in  conjunction  with  Corollary  3.3)  guarantees 
that  the  algorithm  of  the  present  paper  is  fast. 

In  the  proof  of  Theorem  4.2  below,  we  will  need  the  following  technical  lemma. 


Lemma  4.1  Suppose  that  l,  rri,  and  n  are  positive  integers  such  that  m  >  l  >  n.  Suppose 
further  that  A  is  an  m  x  n  matrix,  and  that  the  SVD  of  A  is 


A  —  TT  y  V* 

mxn  ^mxn  ^nxn  v n 


(16) 


where  U  is  an  m  x  n  matrix  whose  columns  are  orthonormal,  V  is  an  n  x  n  matrix  whose 
columns  are  orthonormal,  and  S  is  a  diagonal  nxn  matrix  whose  entries  are  all  nonnegative. 
Suppose  in  addition  that  T  is  an  l  x  m  matrix,  and  that  the  SVD  of  the  l  x  n  matrix  TU  is 

Tlxm.  Urnxn  C;Xn  Tinxn  Vnxn 


(17) 


Then,  there  exist  an  n  x  n  matrix  P,  and  anlx  n  matrix  Q  whose  columns  are  orthonor¬ 
mal,  such  that 


Tlxm  AmXn  —  Qlxn  Pn 


(18) 
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Furthermore,  if  P  is  any  n  x  n  matrix,  and  Q  is  any  l  x  n  matrix  whose  columns  are 
orthonormal,  such  that  P  and  Q  satisfy  (18),  then 


Pnxn  =  ( Qlxn)*  Ulxn  t 


nxn  v  nxn  ^nxn  v  nxro 


(19) 


if,  in  addition,  the  matrices  A  and  TU  both  have  full  rank  (rank  n),  then  there  exists  a 
unitary  n  x  n  matrix  W  such  that 

Ulxn  =  Qlxn  kFnxn-  (20) 


Proof.  An  example  of  matrices  P  and  Q  satisfying  (18)  such  that  the  columns  of  Q  are 
orthonormal  is  P  =  E  V*  E  V*  and  Q  —  U . 

We  now  assume  that  P  is  any  n  x  n  matrix,  and  Q  is  any  l  x  n  matrix  whose  columns 
are  orthonormal,  such  that  P  and  Q  satisfy  (18).  Combining  (18),  (16),  and  (IT)  yields 


Qlxn  Pnxn  Ui 


xn  '-'nxn  v  nxn  °nxn  v  nxn- 


(21) 


Combining  (21)  and  the  fact  that  the  columns  of  Q  are  orthonormal  (so  that  Q*  Q  —  1) 
yields  (19). 

For  the  remainder  of  the  proof,  we  assume  that  the  matrices  A  and  T  U  both  have  full 
rank.  To  establish  (20),  we  demonstrate  that  the  column  spans  of  Q  and  U  are  the  same.  It 
then  follows  from  the  fact  that  the  columns  of  Q  are  an  orthonormal  basis  for  this  column 
span,  as  are  the  columns  of  U,  that  there  exists  a  unitary  nxn  matrix  W  satisfying  (20). 
We  now  complete  the  proof  by  showing  that 


column  span  of  Qixn  =  column  span  of  Uixn 


(22) 


Obviously,  it  follows  from  (21)  that 

column  span  of  Qixn  Pnxn  =  column  span  of  Utxn  Erexn  V*xn  Enxn  V*yn'  (23) 


we  will  simplify  both  sides  of  (23),  in  order  to  obtain  (22). 

It  follows  from  the  fact  that  A  and  T  U  both  have  full  rank  that  the  matrices  E  and  E  in 
the  SVDs  (16)  and  (17)  are  nonsingular,  and  so  (as  the  unitary  matrices  V  and  V  are  also 
nonsingular) 

column  span  of  Ulxn  tnxn  V*xn  Enxn  V*xn  =  column  span  of  Uixn.  (24) 


Combining  (23),  (24),  and  the  fact  that  the  column  span  of  U  is  n-dimensional  (after  all, 
the  n  columns  of  U  are  orthonormal)  yields  that  the  column  span  of  Q  P  is  n-dimensional. 
Combining  this  fact,  the  fact  that  the  column  span  of  Q  P  is  a  subspace  of  the  column  span 
of  Q,  and  the  fact  that  the  column  span  of  Q  is  n-dimensional  (after  all,  the  n  columns  of 
Q  are  orthonormal)  yields 


column  span  of  Qixn  Pnxn  —  column  span  of  Qixn 


(25) 


Combining  (23),  (24),  and  (25)  yields  (22). 


□ 


The  following  theorem  states  that,  given  an  m  x  n  matrix  A,  the  condition  number  of 
a  certain  preconditioned  version  of  A  corresponding  to  an  l  x  m  matrix  T  is  equal  to  the 
condition  number  of  T  U,  where  U  is  an  m  x  n  matrix  of  orthonormal  left  singular  vectors 
of  A. 
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Theorem  4.2  Suppose  that  l,  m,  and  n  are  positive  integers  such  that  m  >  l  >  n.  Suppose 
further  that  A  is  a  full-rank  m  x  n  matrix,  and  that  the  SVD  of  A  is 


A  —  TT  y  V* 
mxn  ^  mxn  ^nxn  v n 


(26) 


Suppose  in  addition  that  T  is  an  l  x  m  matrix  such  that  the  l  x  n  matrix  T  U  has  full  rank. 

Then,  there  exist  an  n  x  n  matrix  P,  and  anlxn  matrix  Q  whose  columns  are  orthonor¬ 
mal,  such  that 


l / X :n  Amxn  Qixn  Pnxn- 


(27) 


Furthermore,  if  P  is  any  n  x  n  matrix,  and  Q  is  any  l  x  n  matrix  whose  columns  are 
orthonormal,  such  that  P  and  Q  satisfy  (27),  then  the  condition  numbers  of  APl  and  TU 
are  equal. 


Proof.  Lemma  4.1  guarantees  the  existence  of  matrices  P  and  Q  satisfying  (27)  such  that 
the  columns  of  Q  are  orthonormal. 

For  the  remainder  of  the  proof,  we  assume  that  P  is  an  n  x  n  matrix,  and  Q  is  an  l  x  n 
matrix  whose  columns  are  orthonormal,  such  that  P  and  Q  satisfy  (27).  Combining  (19), 
(20),  and  the  fact  that  the  columns  of  Q  are  orthonormal  (so  that  Q*  Q  —  1)  yields 


nxn  ^»xn  v  nxn^nxn  v  nxm 


(28) 


where  W  is  the  matrix  from  (20),  and  E  and  V  are  the  matrices  from  the  SVD  (17).  Com¬ 
bining  (26),  (28),  and  the  fact  that  V,  V,  and  W  are  unitary  yields 


A  P~1  —  TT 

^-mxn  A  nxn  ^ 77 


V, 


y-i  u/* 
nxn  ^ nxn  vtnxn' 


(29) 


Combining  (29),  the  fact  that  V  and  W  are  unitary,  the  fact  that  the  columns  of  U  are 
orthonormal  (so  that  U*  U  =  1),  and  the  fact  that  E  is  diagonal  yields 


and 

Combining  (4),  (30), 


||  (Amxn  Pnxn)  (Anxn-P, 


3-1 


-1  II 2 


nxnJ  II  —  II E  nxn 


(( AmxnPnxn )  (4mxn  Pnxn)) 


=  II  Er 


(31),  and  the  SVD  (17)  yields  the  present  theorem. 


(30) 

(31) 
□ 


5  The  algorithm 

In  this  section,  we  describe  the  algorithm  of  the  present  paper,  giving  details  about  its 
implementation  and  computational  costs. 
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5.1  Description  of  the  algorithm 

Suppose  that  m  and  n  are  positive  integers  with  m  >  n,  A  is  a  full-rank  mxn  matrix,  and  b 
is  an  m  x  1  column  vector.  In  this  subsection,  we  describe  a  procedure  for  the  computation 
of  an  n  x  1  column  vector  x  such  that  \\Ax  —  6||  is  minimized  to  arbitrarily  high  precision. 

Rather  than  directly  calculating  the  vector  x  minimizing  ||^4x  — 6||,  we  will  first  calculate 
the  vector  y  minimizing  \\C y  —  6||,  where  C  =  AP -1  and  y  —  P x,  with  an  appropriate 
choice  of  an  n  x  n  matrix  P;  the  matrix  P  is  known  as  a  preconditioning  matrix.  With  an 
appropriate  choice  of  P,  the  condition  number  of  C  is  reasonably  small,  and  so  an  iterative 
solver  such  as  the  conjugate  gradient  method  will  require  only  a  few  iterations  in  order  to 
obtain  y  minimizing  \\C y  —  6||  to  high  precision.  Once  we  have  calculated  y,  we  obtain  x  via 
the  formula  x  =  P~ly- 

To  construct  the  preconditioning  matrix  P,  we  compute  Y  =  T  A,  where  T  is  the  l  x  m 
SRFT  defined  in  (7),  with  m  >  l  >  n.  We  then  form  a  pivoted  QP-decomposition  of  Y, 
computing  an  l  x  n  matrix  Q  whose  columns  are  orthonormal,  an  upper-triangular  n  x  n 
matrix  R,  and  an  n  x  n  permutation  matrix  II,  such  that  Y  =  0  P 1 1 .  We  use  the  product 
P  =  PII  as  the  preconditioning  matrix.  Fortuitously,  since  this  matrix  P  is  the  product  of 
an  upper-triangular  matrix  and  a  permutation  matrix,  we  can  apply  P_1  or  (P^1)*  to  any 
arbitrary  vector  rapidly,  without  calculating  the  entries  of  P_1  explicitly. 

The  condition  number  of  C  —  AP is  reasonably  small  with  very  high  probability 
whenever  /  is  sufficiently  greater  than  n,  due  to  Theorem  4.2  and  Corollary  3.3;  moreover, 
numerical  experiments  in  Section  6  suggest  that  the  condition  number  of  C  is  practically 
always  less  than  3  or  so  when  /  =  4n.  Therefore,  when  l  is  sufficiently  greater  than  n,  the 
conjugate  gradient  method  requires  only  a  few  iterations  in  order  to  compute  y  minimizing 
\\C y  —  b\\  to  high  precision;  furthermore,  the  conjugate  gradient  method  requires  only  appli¬ 
cations  of  A,  A*,  P_1,  and  (P-1)*  to  vectors,  and  all  of  these  matrices  are  readily  available 
for  application  to  vectors.  Once  we  have  calculated  y,  we  obtain  x  minimizing  \[Ax  —  6||  via 
the  formula  x  =  P~ly ■ 

There  is  a  natural  choice  for  the  starting  vector  of  the  conjugate  gradient  iterations. 
Combining  the  fact  that  Y  =  T  A  with  (15)  yields  that,  with  high  probability,  the  n  x  1 
vector  z  minimizing  || Y z  —  Tb\\  also  minimizes  ||  Az  —  b\\  to  within  a  factor  of  3,  provided  l 
is  sufficiently  greater  than  n  (in  practice,  /  =  4 n  is  sufficient).  Thus,  z  is  a  good  choice  for 
the  starting  vector.  Moreover,  combining  the  facts  that  Y  =  Q  P  and  that  the  columns  of  Q 
are  orthonormal  yields  that  z  =  P~1  Q*  T  b ,  providing  a  convenient  means  of  computing  z. 

In  summary,  if  £  is  any  specified  positive  real  number,  we  can  compute  an  n  x  1  column 
vector  x  which  minimizes  \[Ax  —  6||  to  relative  precision  £  via  the  following  five  steps: 

1.  Compute  Y  =  T  A,  where  T  is  the  /  x  m  SRFT  defined  in  (7),  with  m  >  l  >  n.  (See, 
for  example,  Subsection  3.3  of  [9]  for  details  on  applying  the  SRFT  rapidly.) 

2.  Form  a  pivoted  QP-decomposition  of  Y  from  Step  1,  computing  an  l  x  n  matrix  Q 
whose  columns  are  orthonormal,  an  upper-triangular  n  x  n  matrix  R,  and  an  n  x  n 
permutation  matrix  14,  such  that  Y  =  Q  R II.  (See,  for  example,  Chapter  5  in  [5]  for 
details  on  computing  such  a  pivoted  Q R- decomposi t ion . ) 

3.  Compute  the  n  x  1  column  vector  z  =  P~1(Q*(T b)),  where  T  is  the  l  x  m  SRFT 
defined  in  (7),  Q  is  the  /  x  n  matrix  from  Step  2  whose  columns  are  orthonormal,  and 


P  —  R II;  R  and  II  are  the  upper-triangular  and  permutation  matrices  from  Step  2. 
(See,  for  example,  Subsection  3.3  of  [9]  for  details  on  applying  the  SRFT  rapidly.) 

4.  Compute  an  n  x  1  column  vector  y  which  minimizes  ||  A  P~1  y  —  5||  to  relative  precision 
e,  via  the  preconditioned  conjugate  gradient  iterations,  where  P  =  R II  is  the  precon¬ 
ditioning  matrix;  R  and  fl  are  the  upper-triangular  and  permutation  matrices  from 
Step  2.  Use  z  from  Step  3  as  the  starting  vector.  (See,  for  example,  Algorithm  7.4.3 
in  [3]  for  details  on  the  preconditioned  conjugate  gradient  iterations  for  linear  least- 
squares  problems.) 

5.  Compute  x  =  P_1  y,  where  y  is  the  vector  from  Step  4,  and  again  P  =  RYL,  R  and  fl 
are  the  upper-triangular  and  permutation  matrices  from  Step  2. 

5.2  Cost 

In  this  subsection,  we  estimate  the  number  of  floating-point  operations  required  by  each  step 
of  the  algorithm  of  the  preceding  subsection. 

We  denote  by  k  the  condition  number  of  the  preconditioned  matrix  A  P_1.  The  five  steps 
of  the  algorithm  incur  the  following  costs: 

1.  Applying  T  to  every  column  of  A  costs  0{mn  log(/)). 

2.  Computing  the  pivoted  QP-decomposition  of  Y  costs  0{n2  l). 

3.  Applying  T  to  b  costs  0(m  log(/)).  Applying  Q*  to  T  b  costs  0(nl).  Applying  P-1  = 
II"1#"1  to  Q*Tb  costs  0(n2). 

4.  When  l  >  4 n2,  (15)  guarantees  with  high  probability  that  the  vector  z  has  a  residual 
\\Az  —  b ||  that  is  no  greater  than  3  times  the  minimum  possible.  When  started  with 
such  a  vector,  the  preconditioned  conjugate  gradient  algorithm  requires  0(k  log(l/ er)) 
iterations  in  order  to  improve  the  relative  precision  of  the  residual  to  £  (see,  for  ex¬ 
ample,  formula  7.4.7  in  [3]).  Applying  A  and  A*  a  total  of  0(k  log(l/e))  times  costs 
0(mnK  log(l/e)).  Applying  P_1  and  (P-1)*  a  total  of  0(k  log(l/e))  times  costs 
0(n2  k  log(l/e)).  These  costs  dominate  the  costs  of  the  remaining  computations  in 
the  preconditioned  conjugate  gradient  iterations. 

5.  Applying  P-1  =  LU1  P”1  to  y  costs  0(n2). 

Summing  up  the  costs  in  the  five  steps  above,  we  see  that  the  cost  of  the  entire  algorithm  is 

^theoretical  =  O  ((log(Z)  +  K  log(l/e))  777,71  +  7T,2  l)  .  (32) 

The  condition  number  n  of  the  preconditioned  matrix  AP can  be  made  arbitrarily  close 
to  1,  by  choosing  l  sufficiently  large.  According  to  Theorem  4.2  and  Corollary  3.3,  choosing 
l  >  4n 2  guarantees  that  k  is  at  most  3,  with  high  probability. 
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Remark  5.1  Currently,  our  estimates  require  that  l  be  at  least  4n2  in  order  to  ensure  with 
high  probability  that  k  is  at  most  3  and  that  the  residual  ||Az  — 6||  is  no  greater  than  3  times 
the  minimum  possible.  However,  our  numerical  experiments  indicate  that  it  is  not  necessary 
for  l  to  be  as  large  as  4n2  (though  it  is  sufficient).  Indeed,  in  all  of  our  tests,  choosing  /  =  An 
produced  a  condition  number  k  less  than  3  and  a  residual  \\Az  —  6||  no  greater  than  3  times 
the  minimum  possible  residual.  With  /  =  4n  and  k  <  3,  the  cost  (32)  becomes 

Ctypicai  =  O  ((log(n)  +  log(l/e))  mn  +  n3)  .  (33) 

6  Numerical  results 

In  this  section,  we  describe  the  results  of  several  numerical  tests  of  the  algorithm  of  the 
present  paper. 

We  use  the  algorithm  to  compute  an  n  x  1  vector  x  minimizing  —  6||  to  high  precision, 
where  b  is  an  m  x  1  vector,  and  A  is  the  m  x  n  matrix  defined  via  the  formula 

Amxn  =  Umxn  hnxn  h nxn'i  (34) 

in  all  experiments  reported  below,  U  is  obtained  by  applying  the  Gram-Schmidt  process  to 
the  columns  of  an  m  x  n  matrix  whose  entries  are  i.i.d.  centered  complex  Gaussian  random 
variables,  V  is  obtained  by  applying  the  Gram-Schmidt  process  to  the  columns  of  an  n  x  n 
matrix  whose  entries  are  i.i.d.  centered  complex  Gaussian  random  variables,  and  E  is  a 
diagonal  n  x  n  matrix,  with  the  diagonal  entries 

=  i0-6(fc-i)/G-i)  (35) 

for  k  —  1,  2,  . . , ,  n  —  1,  n.  Clearly,  the  condition  number  ka  of  A  is 

KA  =  Sl.l/Sn.n  =  10®.  (36) 

The  m  x  1  unit  vector  b  is  defined  via  the  formula 

b  =  10~3w  +  Ax,  (37) 

where  w  is  a  random  m  x  1  unit  vector  orthogonal  to  the  column  span  of  A,  and  Ax  is  a 
vector  from  the  column  span  of  A  such  that  ||6||  =  1. 

We  implemented  the  algorithm  in  Fortran  77  in  double-precision  arithmetic,  and  used  the 
Lahey /Fujitsu  Express  v6.2  compiler.  We  used  one  core  of  a  1.86  GHz  Intel  Centrino  Core 
Duo  microprocessor  with  1  GB  of  RAM.  For  the  direct  computations,  we  used  the  classical 
algorithm  for  pivoted  QR-decompositions  based  on  plane  (Householder)  reflections  (see,  for 
example,  Chapter  5  in  [5]). 

Table  1  displays  timing  results  with  m  =  32768  for  various  values  of  n;  Table  2  displays 
the  corresponding  errors.  Table  3  displays  timing  results  with  n  =  256  for  various  values  of 
m;  Table  4  displays  the  corresponding  errors. 

The  headings  of  the  tables  are  as  follows: 

•  m  is  the  number  of  rows  in  the  matrix  A,  as  well  as  the  length  of  the  vector  b,  in 
||  Ax  —  b\\. 
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n  is  the  number  of  columns  in  the  matrix  A,  as  well  as  the  length  of  the  vector  x,  in 
||  Ax  —  6||. 


•  l  is  the  number  of  rows  in  the  matrix  T  used  in  Steps  1  and  3  of  the  procedure  of 
Subsection  5.1. 

•  ^direct  is  the  time  in  seconds  required  by  the  classical  algorithm. 

•  trari d  is  the  time  in  seconds  required  by  the  algorithm  of  the  present  paper. 

•  t direct / trand  is  the  factor  by  which  the  algorithm  of  the  present  paper  is  faster  than  the 
classical  algorithm. 

•  k  is  the  condition  number  of  AP^1,  the  preconditioned  version  of  the  matrix  A. 

•  i  is  the  number  of  iterations  required  by  the  preconditioned  conjugate  gradient  method 
to  yield  the  requested  precision  erei  of  .5E-14  or  better  in  Table  2,  and  .5E-10  or  better 
in  Table  4. 


•  £re\  is  defined  via  the  formula 

_  $  ^min 

Ael  7  i 

ka  •  0  min 


where  ka  is  the  condition  number  of  A  given  in  (36),  6 
vector  produced  by  the  randomized  algorithm),  and 


(38) 


\\Ax  —  6||  (x  is  the  solution 


Cm  i  n 


=  min  \\Ay  —  6||  =  10  3. 

ye€n 


(39) 


Remark  6.1  Standard  perturbation  theory  shows  that  erei  is  the  appropriately  normalized 
measure  of  the  precision  produced  by  the  algorithm;  see,  for  example,  formula  1.4.27  in  [3]. 

The  values  for  £rei  and  i  reported  in  the  tables  are  the  worst  (maximum)  values  encoun¬ 
tered  during  10  independent  randomized  trials  of  the  algorithm,  as  applied  to  the  same 
matrix  A.  The  values  for  trand  reported  in  the  tables  are  the  average  values  over  10  inde¬ 
pendent  randomized  trials.  None  of  the  quantities  reported  in  the  tables  varied  significantly 
over  repeated  randomized  trials. 

The  following  observations  can  be  made  from  the  examples  reported  here,  and  from  our 
more  extensive  experiments: 

1.  When  m  =  32768  and  n  =  512,  the  randomized  algorithm  runs  over  5  times  faster  than 
the  classical  algorithm  based  on  plane  (Householder)  reflections,  even  at  full  double 
precision. 

2.  As  observed  in  Remark  5.1,  our  choice  /  =  4n  seems  to  ensure  that  the  condition 
number  k  of  the  preconditioned  matrix  is  at  most  3.  More  generally,  k  seems  to  be 
less  than  a  function  of  the  ratio  l/n. 

3.  The  algorithm  of  the  present  paper  attains  high  precision  at  reasonably  low  cost. 


11 


7  Conclusions  and  generalizations 

This  article  provides  a  fast  algorithm  for  overdetermined  linear  least-squares  regression.  If 
the  matrices  A  and  A*  from  the  regression  involving  \\Ax  —  6||  can  be  applied  sufficiently 
rapidly  to  arbitrary  vectors,  then  the  algorithm  of  the  present  paper  can  be  accelerated 
further.  Moreover,  the  methods  developed  here  for  overdetermined  regression  extend  to 
underdetermined  regression. 

The  theoretical  bounds  in  Lemma  3.2,  Corollary  3.3,  and  Lemma  3.4  should  be  considered 
preliminary.  Our  numerical  experiments  indicate  that  the  algorithm  of  the  present  article 
performs  better  than  our  estimates  guarantee.  Furthermore,  there  is  nothing  magical  about 
the  subsampled  randomized  Fourier  transform  defined  in  (7).  In  our  experience,  several  other 
similar  transforms  appear  to  work  at  least  as  well,  and  we  are  investigating  these  alternatives 
(see,  for  example,  [2]). 
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Table  1: 


m 

n 

l 

^direct 

^rand 

^direct/  ^rand 

32768 

64 

256 

.14E+01 

.13E+01 

l.i 

32768 

128 

512 

.55E+01 

.27E+01 

2.0 

32768 

256 

1024 

.22E+02 

.59E+01 

3.7 

32768 

512 

2048 

.89E+02 

.15E+02 

5.7 

Table  2: 


m 

n 

l 

K 

i 

^rel 

32768 

64 

256 

2.7 

14 

.120E-15 

32768 

128 

512 

2.9 

14 

.132E-15 

32768 

256 

1024 

2.9 

14 

.429E-15 

32768 

512 

2048 

2.9 

13 

.115E-14 

Table  3: 


m 

n 

l 

^direct 

^rand 

^•direct/  ^rand 

2048 

256 

1024 

.12E+01 

.71E+00 

1.6 

4096 

256 

1024 

.25E+01 

.94E+00 

2.6 

8192 

256 

1024 

.51E+01 

.14E+01 

3.5 

16384 

256 

1024 

.10E+02 

.26E+01 

4.1 

32768 

256 

1024 

.22E+02 

.50E+01 

4.4 

65536 

256 

1024 

.49E+02 

.11E+02 

4.4 

Table  4: 


m 

n 

l 

K 

i 

^rel 

2048 

256 

1024 

2.2 

4 

.326E-10 

4096 

256 

1024 

2.6 

5 

.364E-10 

8192 

256 

1024 

2.7 

6 

.160E-10 

16384 

256 

1024 

2.8 

7 

.599E-11 

32768 

256 

1024 

2.9 

8 

.502E-11 

65536 

256 

1024 

2.9 

8 

.177E-11 
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